Resource efficient aortic distensibility calculation by end to end spatiotemporal learning of aortic lumen from multicentre multivendor multidisease CMR images

Aortic distensibility (AD) is important for the prognosis of multiple cardiovascular diseases. We propose a novel resource-efficient deep learning (DL) model, inspired by the bi-directional ConvLSTM U-Net with densely connected convolutions, to perform end-to-end hierarchical learning of the aorta from cine cardiovascular MRI towards streamlining AD quantification. Unlike current DL aortic segmentation approaches, our pipeline: (i) performs simultaneous spatio-temporal learning of the video input, (ii) combines the feature maps from the encoder and decoder using non-linear functions, and (iii) takes into account the high class imbalance. By using multi-centre multi-vendor data from a highly heterogeneous patient cohort, we demonstrate that the proposed method outperforms the state-of-the-art method in terms of accuracy and at the same time it consumes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document}∼ 3.9 times less fuel and generates \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim$$\end{document}∼ 2.8 less carbon emissions. Our model could provide a valuable tool for exploring genome-wide associations of the AD with the cognitive performance in large-scale biomedical databases. By making energy usage and carbon emissions explicit, the presented work aligns with efforts to keep DL’s energy requirements and carbon cost in check. The improved resource efficiency of our pipeline might open up the more systematic DL-powered evaluation of the MRI-derived aortic stiffness.


Related work
Two recent papers deployed deep learning (DL)-based approaches to fully automate the aortic lumen segmentation from cine CMR images 24,25 .DL or hierarchical representations learning is a rapidly growing branch of machine learning in which the models learn complex raw-input-data representations that are tuned to the specific task at hand 26 .The deployment of DL models has been a game changer in computer vision (among a wide range of other industries) as it has permitted to achieve or even surpass human-level performance in a number of visual tasks including image classification, object detection, and semantic segmentation 27 .Recent research has highlighted the superb capabilities of DL models to analyse CMR images 28,29 .However, the two aorta DL studies treated the task at hand as a sparse annotation problem and evaluated their method only on a very limited number of cardiac cycle time frames (namely the end diastole (ED) and end systole (ES)) for which the ground truth labels were available.The targets of the remaining time frames (that were also used to map the input data during training) could not be visually validated and were acquired by pipelines which are likely to suffer from registration errors or poor convergence of the active contour algorithm.On top of this, the previous studies either completely ignored the temporal continuity inherent in the cine image sequence by treating the image segmentation task as static 25 , or simulated the use of time by stacking the recurrent part after the convolutional layers 24 .However, correlated spatio-temporal features cannot be learnt when spatial and temporal features are explicitly determined in separate regions of the network 30 .In addition, the previous papers combined the feature maps from the encoder and decoder using simple concatenation.Nonetheless, it has been argued that such a practice results in less precise segmentation than when non-linear functions are employed for this task 31 .Next, the loss function used by previous studies neglected the fact that the region of interest (ROI) class is significantly smaller than the background class.Moreover, the previous papers analysed datasets acquired by following a single data acquisition protocol on a relatively healthy cohort.Finally, previous work completely ignored the resource efficiency matter of the proposed pipelines.However, this is an important issue as the substantial computation and energy demands by DL models go together with a considerable environmental and financial cost 32 .Improving the efficiency of algorithms should be a high priority in DL research alongside accuracy 32 .

Overview of the proposed method
In this paper, we propose to enhance the AD calculation by performing aortic lumen segmentation throughout the cardiac cycle using of a novel resource-efficient spatio-temporal DL model, inspired by the bi-directional ConvLSTM (BConvLSTM) U-Net with densely connected convolutions 31 .Our network is trained from scratch and evaluated on an aortic cine CMR dataset that was fully annotated by experts.The present paper is the first (1) AD 10 −3 mmHg −1 = A max − A min A min × PP work to perform end-to-end (i.e., over the entire cardiac cycle) hierarchical learning and testing of the aortic lumen area from cine CMR images.Our approach joins the temporal with the spatial processing of the video input by merging the encoder and decoder feature maps through a BConvLSTM (non-linear) unit 33 .We employ the focal Tversky loss during training which is better suited for problems with a high class imbalance in the data 34 .We use multi-centre multi-vendor data from a highly heterogeneous patient cohort which significantly adds to the generalisation power of the proposed aortic lumen segmentation algorithm.Finally, the network we propose in this study is resource-efficient to help promote environmentally friendly and more inclusive DL research and practices.We perform quantitative evaluations of the energy consumption and carbon cost during training.It is worth noting that resource-efficient approaches are also more well-suited towards deployment in hardwareconstrained platforms, which in turn will open up the more widespread DL-powered evaluation of the CMRderived aortic stiffness.To examine the impact brought by each contributing factor, we perform ablation studies.

Model accuracy
Table 1 details the absolute errors in aortic area (in mm 2 ) and AD (in mmHg −1 ) as well as the Dice coefficient for both AAo and DAo.The proposed method attained lower absolute area and AD errors and higher Dice coefficient values in comparison to both the state-of-the-art (SOTA) and unpruned methods.We have observed every model reaching a plateau before 250 epochs.Figures 1, 2 and 3 display the (maximum and minimum) area and AD Bland-Altman analysis plots for the proposed, SOTA and unpruned methods, respectively, versus the ground truth.It can be seen from the the limits of agreement of the Bland-Altman analysis plots versus the ground truth that the maximum and minimum aorta area as well as the AD values predicted by the proposed method had, on average, ∼ 3.9 times less fluctuation than the SOTA and unpruned methods.Figure 4 qualitatively compares the proposed and SOTA networks for three cases.Case 1 shows an instance where the SOTA method severely underestimated the number of pixels that had to be assigned to the aorta.Case 2 is a patient where the SOTA method over-segmented the AAo.Case 3 is a rare instance where the DAo is in an abnormal position relative to the AAo and for which the SOTA method was unable to identify the AAo.Our method accurately segmented both AAo and DAo for all three cases (Table 2).Table 3 shows the results of the qualitative and quantitative comparisons of ascending and descending aorta time-area curves for one cardiac cycle of a representative case.It is apparent that the cardiac cycle phases of the proposed and the ground truth curves are in much closer agreement (for both AAo and DAo) than in the SOTA cases.

Model resource efficiency
Table 2 gives the predicted carbon dioxide equivalent (CO 2 eq) emissions (in g), the consumed energy (in kWh) and the equivalent distance (in km) a car could travel during the final training (250 epochs) for the proposed, SOTA and unpruned models.Also listed in Table 2 are the training time (in h:min:s) and the average inference Table 1.Quantitative end-to-end performance of the proposed method using the absolute errors in aortic area and aortic distensibility (AD) and the Dice coefficient.Also provided are the p values obtained from the Wilcoxon-signed rank test with Bonferroni correction ( α = 0.05).The mean ground truth area (averaged over time and patients) was 678.826 mm  time (in ms).The proposed method was ∼ 2.8 times more environmentally friendly and it consumed ∼ 3.9 times less fuel when compared with SOTA.In addition, it required ∼ 5.2 times less time to train, and its inference time was ∼ 2.7 times faster.Our method was also ∼ 5.2 times less polluting and it consumed ∼ 6.6 times less fuel when compared with the unpruned method.Lastly, it required ∼ 9.3 times less time to train, and its inference time was ∼ 4.4 times faster than the unpruned version.

Ablation study
In this section, we carried out ablation experiments over a number of features of the proposed framework to better appreciate their proportionate significance.In particular: No full labels: is trained using labels obtained by propagating ES and ED labels using non-rigid image registration.

Figure 1.
Bland-Altman analysis for graphically comparing the proposed method to the ground truth with respect to aorta maximum areas, aorta minimum areas and aortic distensibility (AD) values.Y-axis gives the difference between the two methods whereas X-axis represents their mean.Area is measured in mm 2 .AD is measured in 10 −3 ×mmHg −1 .SD is the standard deviation.
No focal Tversky: is trained using Dice coefficient loss, ignoring the severe class imbalance in the data.No non-linearities: is trained by merging the encoder and decoder feature maps through linear concatenation.
No dense pruning: is trained using three densely packed convolutional blocks in the final encoding step.No filter pruning: is trained using four times more filters in the convolutional layers than the proposed model.No BConvLSTM pruning: is trained using BConvLSTM unit in three steps instead of one.To explore the effect of each factor on model accuracy, we employed the same evaluation metrics that were used in Table 1 and the same clinical images and hyperparameters that were used in the proposed framework.Table 4 gives the ablation results.It can be seen that not using non-linearities caused the largest drop in AD calculation performance.It is surprising that increasing the model size did not lead to accuracy improvement.This was possibly due to overfitting.In addition, not using focal Tversky loss and the full labels had more significant Bland-Altman analysis for graphically comparing the SOTA method 24 to the ground truth with respect to aorta maximum areas, aorta minimum areas and aortic distensibility (AD) values.Y-axis gives the difference between the two methods whereas X-axis represents their mean.Area is measured in mm 2 .AD is measured in 10 −3 ×mmHg −1 .SD is the standard deviation.impact on the AAo AD calculation, most likely due to the increased number of false positives caused by surrounding structures.Lastly, it can be seen that the drop in performance for both the absolute error in area and Dice coefficient are large and uniform for every row in Table 4, whereas the absolute error in AD is a lot more variable.to the ground truth with respect to aorta maximum areas, aorta minimum areas and aortic distensibility (AD) values.Y-axis gives the difference between the two methods whereas X-axis represents their mean.Area is measured in mm 2 .AD is measured in 10 −3 ×mmHg −1 .SD is the standard deviation.
AAo is the ascending aorta, DAo is the descending aorta, SD is the standard deviation and SOTA represents the state-of-the-art method 24 .
SOTA denotes the state-of-the-art method 24 .

Ethical approval and consent to participate
Each study was approved by the UK national research and ethics service and written informed consent was obtained from all subjects prior to participation.

Discussion
The elevated aortic stiffness has been linked to an increased risk of cardiovascular diseases.In this study, a novel resource-efficient DL model was proposed for the fast, robust, and fully automated segmentation of the AAo and the DAo from aortic cine CMR images towards streamlining quantification of AD.

Strengths of this study
This study has a number of strengths compared to previous work.We resolved to not apply the semantic segmentation algorithm to each time frame of the cardiac cycle as a separate entity.Instead, we incorporated information related to both space and time into our task by merging the encoder and decoder feature maps in the second up-sampling layer through a non-linear function, namely a BConvLSTM.The use of the hyperbolic tangent function for combining the output of the forward and backward paths helps the network learn complex data structures 31 .To address the issue that the foreground is significantly smaller than the background class, we employed the focal Tversky loss during training.Following the enhanced credibility of the ground truth targets in all cardiac cycle time frames, we performed end-to-end hierarchical learning and testing of the aortic lumen area from cine CMR images, rather than focusing on a very limited range of time frames in the cardiac cycle.
Another strength of this study is that we used multi-centre multi-vendor data from a highly diverse patient cohort which significantly adds to the ability of the proposed aortic lumen segmentation algorithm to generalise.The site, vendor, and patient heterogeneity when testing a model are crucial to the effective clinical implementation and to get approval by accreditation agencies.Unlike the model that motivated this paper 31 , our algorithm uses a building block that returns the sequence of feature maps over all time steps since we are dealing with video inputs.In addition, and in pursuit of a resource-efficient architecture, our algorithm: (i) has four times fewer filters in the convolutional layers, (ii) involves BConvLSTM layers in three times fewer steps, and (iii) contains only one third of densely packed convolutional blocks in the final encoding step.

Main findings
The AAo and DAo segmentation masks predicted by our model were in close agreement with the semi-automated annotations by the CMR experts, as evidenced by the low area errors and the high Dice coefficient values.This finding corroborates that only one densely packed convolution block in the final layer of the contracting path sufficed to learn diverse features 35 .The proposed method was compared to other DL models that represent the SOTA and unpruned methods 24,31 .The quantitative analysis showed that our model outperformed both SOTA and unpruned methods in terms of segmentation accuracy, attaining lower aortic area and AD absolute errors and higher Dice coefficient values for both AAo and DAo.Furthermore, the limits of agreement of the Bland-Altman analysis plots versus the ground truth revealed that the maximum and minimum aorta area as well as the AD values predicted by the proposed method had, on average, ∼ 3.9 times less fluctuation than the SOTA and unpruned methods.These findings are in keeping with recent studies 36 , which reported that more powerful class-discriminative features can be captured in an efficient manner by tying up the temporal with the spatial processing in video inputs.Importantly and as an indication of an enhanced ability to generalise, our method, unlike SOTA , was found to accurately segment the aortic areas for a rare (and unseen by the algorithm) case for which the DAo was located at an abnormal position with respect to the AAo.Such an infrequent condition can be observed in individuals with scoliosis or patients with aortic arch branching and orientation abnormalities.Unlike previous studies 24,25 , the performance of the evaluated models on the AAo was slightly inferior to that on the DAo.A reason for this observation might be that the structures surrounding the AAo seem to be more challenging to differentiate between them in our diverse dataset.The consistency of the proposed segmentation method across the cardiac cycle was double-checked by demonstrating AAo and DAo time vs. area curves and estimating curve distances from the ground truth for a representative case.Finally, the proposed method was compared to SOTA and unpruned methods also in terms of resource-efficiency and carbon footprint using the Table 3. Qualitative and quantitative comparisons of ascending and descending aorta time-area curves for one cardiac cycle of a representative case.SOTA denotes the state-of-the-art method 24 .Bold face indicates best performance.www.nature.com/scientificreports/Carbontracker tool 37 .Our method was found to consume ∼ 3.9 times less fuel and it was ∼ 2.8 less polluting during training than SOTA.In addition, it required ∼ 5.2 times less time to train, and its inference time was ∼ 2.7 times shorter than SOTA.As expected, our method was also ∼ 5.2 times less polluting and it consumed ∼ 6.6 times less fuel when compared with the unpruned method.Lastly, it required ∼ 9.3 times less time to train, and its inference time was ∼ 4.4 times faster than the unpruned version.Ablation studies showed that using BConvLSTM non-linearities gave the largest boost in AD calculation accuracy.Increasing the number of model parameters led to overfitting.Lastly, not using focal Tversky loss and the full labels had more significant impact on the AAo AD calculation, most likely due to the increased number of false positives caused by surrounding structures.

Clinical implications
Previous experiments conducted in our lab have shown that the AD aortic stiffness measure has better reproducibility than the pulse wave velocity 38 .The excellent results presented in this paper allow to significantly reduce the time spent on extracting aortic structural and functional phenotypes from CMR data, while also improving the reliability of the results.Therefore, our pipeline could provide a valuable resource for exploring genome-wide associations of the AD and aortic areas with the cognitive performance for very large-scale biomedical databases (such as the UK Biobank) 39,40 , which better represent the wider population.Obtaining quantitative CMR phenotypes at such a scale remains a challenge nowadays.This type of analysis would enable us to investigate possible causal relationships of the aortic measures with aortic aneurysms and brain small vessel disease, as well the bidirectional relationship with blood pressure indices.The definition of the responsible mechanisms would eventually lead to (i) improved understanding of the factors that contribute to cognitive decline and dementia, and (ii) identification of new therapeutic targets.Moreover, the model developed for the AD quantification task could be reused as the starting point for various clinically relevant tasks (through transfer learning) to allow rapid progress and enhanced performance.The proposed framework could easily be integrated within a CMR analysis software.We have shared the code of our image analysis pipeline online (https:// github.com/ tuana qeelb ohoran/ Aortic-Diste nsibi lity.git).For our method to substantially impact routine clinical care, appropriate infrastructure for deployment and evaluation within the healthcare centre is needed.This ideally should include ML-ops platforms, pipelines for ensuring safety standards through identifying potential (such as methods for recognizing when the algorithm is operating in a more complex landscape than the one used for training or model explainability methods that highlight decision-relevant parts of feature representations).

Resource efficiency considerations
It was found that the computations required for DL research have grown 300,000-fold from 2012 to 2018 41 , which is much faster than the rate at which compute demand has historically increased.These computations require staggering amounts of energy for fuelling them 32 .Given that electricity usage is correlated with greenhouse gas emissions, the associated carbon footprint is also outsized which appears to accelerate global climate change 32 .Even though the biggest part of this problem is caused by the very large-scale models used in natural language processing applications, the energy consumed by typical medical image analysis DL models is far from negligible.We estimated that the energy required to train our model was 5.984 KWh, as opposed to 23.183 KWh for the SOTA model.The respective generated carbon emissions were equivalent to those produced when a car travels a distance of 17.388 km for our model versus 48.052 km for SOTA.However, it is important to understand that those figures refer to the final training only.Before the final training run, resolving the optimal model in a Table 4. Ablation over number of features using the proposed framework."No full labels" is trained using labels obtained by propagating ES and ED labels using non-rigid image registration."No focal Tversky" is trained using Dice coefficient loss, ignoring the severe class imbalance in the data."No non-linearities" is trained by merging the encoder and decoder feature maps through linear concatenation."No dense pruning" is trained using three densely packed convolutional blocks in the final encoding step."No filter pruning" is trained using four times more filters in the convolutional layers than the proposed model."No BConvLSTM pruning" is trained using BConvLSTM unit in three steps instead of one..
The significance of the dire numbers mentioned above is colossal, especially after taking the worldwide adoption of healthcare DL applications into consideration.Despite the fact that few concerns involving the energy usage and carbon footprint of DL research started to emerge 32,37,42 , the overwhelming majority of DL research in computer vision for healthcare are only concerned with enhancing accuracy while ignoring resource efficiency.
In this paper, we made energy usage and CO 2 eq emissions explicit.We endorse prior calls for making those key metrics evaluation criteria alongside accuracy-related measures in DL research, as a means to accelerate innovations in DL algorithmic efficiency 32,[42][43][44][45] .The value of DL models should be judged by the amount of intelligence they provide per joule.Such an initiative would ultimately help: (i) reduce the adverse environmental impact of DL research during training and development, and (ii) make DL research more inclusive by letting more people participate 45 .With the spectre of an energy-hungry future looming alongside the escalating rates of natural disaster, it is imperative to explore ways to keep DL's energy usage and carbon cost in check 32 .Finally, it is worth noting that the resource efficiency element is appealing also because it might increase the portability and practicality of our approach by promoting its wider distribution to devices with lower computational power and memory, which in turn might open up the more universal and systematic DL-powered evaluation of the CMR-derived aortic stiffness.

Study limitations
This study has several limitations.Despite the large and diverse training dataset, our network is likely to produce less-accurate results if it is presented with pathologies, ages, ethnic backgrounds, and scanners that have not been included in the training set.In fact, this is the main limitation hampering the deployment of any DL model in the real world.To deal with this issue to a certain extent, we employed data augmentation techniques that simulate various possible data distributions.Altogether, it was illustrated that our pipeline exhibited improved ability to generalise to an unseen non-representative case compared to SOTA.Another main limitation of our model is that it lacks interpretability due to the intrinsic "black-box" nature of the DL algorithms.Explainable tools are a prerequisite for building trust in DL models, and the development of such a module will be a topic of future research 46 .Furthermore, our model has not been tested against adversarial attacks 47 .Even though Carbontracker supports a variety of different environments and platforms, there might be small deviations of the reported energy consumption and carbon emission values from the true ones.Such deviations might be caused by the quality of the estimated carbon intensity of electricity production which varies by geographic location (depending on the energy sources that power the local electrical grid) and time throughout the day (as energy demand and capacity change).However, all DL experiments in this study were conducted at the same workstation and time period during the day.In addition, Carbontracker uses "real-time" carbon intensity values which are fetched every 15 min during training using the application programming interfaces supported by this tool.In this study, the predicted maximum and minimum aortic areas, that were also used for the quantification of AD, did not always relate with the diastolic and systolic cardiac cycle phases, respectively, of the handcrafted analysis.However, this did not significantly affect the AD measurements.Our study required full annotation of all temporal frames in the cine dataset, as opposed to SOTA that required only sparse annotation.However, the SOTA method also needs extra resources (people, time) to identify ED and ES phases.It also needs significant resources (hardware, time) to perform the very computationally-intensive non-rigid registration.Lastly, the annotation in our study was not so time-consuming because it was largely supported by the JIM software that performed automated label propagation, which was then verified by the clinicians.Therefore, the consensus on the standards is that our study is concerned only with training resources.Finally, other recent fully-automated segmentation approaches do exist 48,49 , that could potentially perform better than ours in the aortic lumen delineation task from cine CMR images.However, the goal of this work was to propose a DL-based framework that surpasses the SOTA methods for this particular task while being more resource-efficient, rather than to carry out an exhaustive survey of semantic segmentation, the literature of which is huge.

Conclusion
This paper proposed a novel resource-efficient DL model for the fast, robust, and fully-automated segmentation of the AAo and DAo from aortic cine CMR images towards streamlining quantification of AD.When evaluated on a large multi-centre multi-vendor dataset from a highly heterogeneous patient cohort, the proposed method outperformed the SOTA method in terms of accuracy and at the same time it consumed ∼ 3.9 times less fuel and generated ∼ 2.8 less carbon emissions.Notably, the proposed method was even more accurate than the unpruned method.Our model could provide a valuable tool for exploring genome-wide associations of the AD and aortic areas with the cognitive performance in very large-scale biomedical databases.By making energy usage and greenhouse gas emissions explicit, the presented work aligns with efforts to keep DL's energy requirements and carbon cost in check.The improved resource efficiency element of our pipeline might open up the more universal and systematic DL-powered evaluation of the CMR-derived aortic stiffness.

Study population and image dataset
The study population comprises participants from four clinical studies analysed at the University Hospitals of Leicester NHS Trust MRI core lab which included AD assessment.These included participants with spontaneous coronary artery dissection 50 , asymptomatic type 2 diabetes (from Lydia 51 and DIASTOLIC 52  www.nature.com/scientificreports/(the Pathway 2 study) and healthy volunteers (recruited in above studies 50,51 ).In total, we analysed 424 aortic MRI datasets taken from 376 patients.The demographic, anthropometric and clinical characteristics of the participants are presented in Table 5.Each study was approved by the UK national research and ethics service and written informed consent was obtained from all subjects prior to participation.All methods were performed in accordance with the relevant guidelines and regulations.

MRI acquisition
Acquisitions were undertaken on either 1.5T or 3T scanners from three centres using four different MRI scanners: Leicester (Siemens Aera, 1.5T and Skyra, 3T), Cambridge (GE Signa, 1.5T) and Dundee (Siemens Tri-oTim, 3T).SSFP cine images of the AAo and DAo in a plane perpendicular to the thoracic aorta at the level of the pulmonary artery bifurcation were reconstructed to 30-40 phases as previously described 20,38 .The typical image matrix was 256 × 186 to 256 pixels.The in-plane pixel height and width varied between 1.093mm and 1.914mm.The slice thickness for cine imaging was 8mm.Brachial blood pressure was measured simultaneously to determine pulse pressure.

Data pre-processing and annotation
The end-to-end data annotation was carried out semi-automatically by experts from the Glenfield hospital in Leicester using the Java Image Manipulation Software Version 6 (Xinapse Systems Ltd, Essex, UK) 53 as previously described 20,38 .The ascending and descending aorta was analysed by manually contouring the first and last phase and every 6th phase in between.Then images were propagated to fill in the other phases of the sequence.All phases were then manually checked and adjusted if required.The experts were blinded to the patient details.
All MRI images and masked images (annotations) were zero-padded to 256 × 256 pixels so that their dimensions match.

Neural network architecture
The proposed architecture (Fig. 5) was inspired by the BConvLSTM U-Net with densely connected convolutions 31 .It consists of: (i) a contracting path (a.k.a. the encoder) for capturing the context in the image by transforming it into a high-level feature representation, and (ii) a symmetric expanding path (a.k.a. the decoder) for interpreting the feature maps, enabling precise localisation (where in the image) and producing a full resolution segmentation map.There are four down-sampling layers in the encoder and four up-sampling layers in the decoder.Each encoding step consists of a sequence of two convolutional layers (3 × 3 filters and a ReLU non-linear activation 54 ) followed by a Batch Normalisation (BN) layer 55 and a 2 × 2 max-pooling.BN was employed so that the optimiser converges faster when training the network, whereas the max-pooling operation was used to aggressively down-sample the feature maps.To reduce over-fitting, dropout regularisation 56 was employed in the last two steps of the contracting path before the BN layer.The number of filters that each encoding layer computes over the input doubles at each step ( [16, 32, 64, 128, 256]).
Each decoding layer consists of a sequence of two convolutional layers (3 × 3 filters and a ReLU non-linear activation) apart from the final decoding step that has five convolutional layers.In each step of the expanding path, the output of the previous layer is passed onto an up-conv layer (i.e.up-sampling function followed by a 2 × 2 convolution; this process doubles the size of the feature map) and then combined with the corresponding (same-resolution-level) representation in the contracting path using skip connections.The combination of these two types of feature maps is a channel-wise concatenation in all steps except for the second up-sampling layer, where we propose to merge them in a more complex way using (apart from concatenation also) a BConvLSTM 33 building block that outputs information about all temporal hidden states.This is to account for the spatiotemporal composition of the input.The BConvLSTM building block is made of two ConvLSTMs.ConvLSTM is a variant of the traditional LSTM neural network that is specifically designed to process spatial data.By incorporating convolutional structures within the LSTM gates, ConvLSTM is capable of capturing spatial dependencies in the data.Figure 5  ← − H y denote the forward and backward convolution kernels corresponding to the hidden states, b represents the bias term, and the hyperbolic tangent was employed to combine the outputs of the forward and backward paths in a non-linear way.The mathematical equations for obtaining each of the two hidden state tensors are as follows: Here, I t , F t , and O t denote the input, forget, and output gates, respectively, at time t , while C t and σ represent the cell state and the sigmoid function, respectively, and x t refers to the input at time t (Fig. 5).The input gate gov- erns the information that is retained in the cell state.The forget gate regulates the information that is discarded from the cell state.The output gate controls the information utilized to compute the output of the LSTM.The cell state maintains the internal state of the ConvLSTM.The * operator symbolises the convolution operation, and • represents the Hadamard product (element-wise multiplication).
The number of channels reduces in every step of the expanding path ([256, 128, 64, 32, 16, 1]), whereas the size of the feature maps progressively increases to reach the input size after the final layer.
Unlike the Azad et al. 31 that inspired this work, our method uses a BConvLSTM building block that returns the sequence of feature maps over all time steps since the dataset was time distributed.In addition, and in pursuit of a more efficient architecture, our method: (i) uses four times less filters in the convolutional layers compared to Azad et.al. 31, (ii) involves BConvLSTM in only one step, (iii) contains only one densely packed convolutional block in the final encoding step.

Implementation and training
The dataset was randomly split into a training set (272 datasets), validation set (68 datasets) and a testing set (84 datasets).For training, we used the Adam optimiser 57 for 250 epochs with a constant learning rate of 0.001 and a batch size of 120 (approximately 4 patients).The dropout value that we used was 0.5.The initialiser of the network was He Normal 58 .In order to improve the proposed model's ability to generalise, online data augmentation techniques were employed.The augmented data were obtained from the original images by applying random rotations (by a degree between -30 • and +30 • ) and random translations along the x-and/or y-axis in either direction (by up to 20 pixels).All hyperparameters were tuned using grid search based on the validation accuracy.To address the severe class imbalance between pixel values 0 and 1 in each frame, we utilised the Focal Tversky loss function defined as where γ is the adjustable focussing parameter and Tversky Loss is the Tversky index given by where p 0i is the probability that pixel i belongs to the aorta and p 1i is the probability of pixel i being in the background class 59,60 .In addition, g 0i is 1 for the aortic vessel segmentation area and 0 for the background, and the opposite is true for g 1i .Finally, α and β are variables in Tversky Loss, which control the magnitude of the penalties for false positives and false negatives, respectively.To improve model convergence and the recall rate 59,60 , we trained our model with α = 0.8, β = 0.8 and γ = 1.All the methods were trained using the TensorFlow environment.

Model evaluation and statistical analysis
For evaluating the automated segmentation masks produced by our method relatively to the ground truth, we employed the Dice coefficient as well as the absolute area error (in mm 2 ) and absolute AD error (in mmHg −1 ).
In addition, we used Bland-Altman analysis for assessing the agreement between (maximum and minimum) aorta areas and AD values 61 .The temporal fidelity of the segmentation performance across a cardiac cycle was assessed both qualitatively and quantitatively using the Fréchet, Hausdorff and dynamic time warping (DTW) distances for a representative case.To examine the impact brought by each contributing factor, we performed ablation studies.All analyses described above was performed for both the AAo and DAo.The reported results refer to the test set.
For evaluating the resource efficiency of our method, we calculated the CO 2 eq emissions (in g) and energy spent (in kWh) during training using the Carbontracker method 37 .Carbontracker is an open-source tool written in Python for tracking and predicting the energy consumption and carbon emissions of training DL models for a given GPU.To put the carbon footprint produced during model training in context, we also reported the equivalent distance travelled by car 62 that would generate the same emission volume.The training and inference times were also computed.For comparison purposes, all evaluation metrics of the proposed method were juxtaposed with those produced by the SOTA 24 and unpruned 31 methods, also trained on the same multi-centre, multi-vendor multi-disease CMR dataset following similar training and hyperparameter tuning procedures as those described above.To determine if there is a statistically significant difference (at level 0.05) between the performances of the proposed and SOTA methods, we used the Wilcoxon signed-rank test with Bonferroni correction.

Figure 2 .
Figure 2.Bland-Altman analysis for graphically comparing the SOTA method24 to the ground truth with respect to aorta maximum areas, aorta minimum areas and aortic distensibility (AD) values.Y-axis gives the difference between the two methods whereas X-axis represents their mean.Area is measured in mm 2 .AD is measured in 10 −3 ×mmHg −1 .SD is the standard deviation.

Figure 3 .
Figure 3. Bland-Altman analysis for graphically comparing the unpruned method31 to the ground truth with respect to aorta maximum areas, aorta minimum areas and aortic distensibility (AD) values.Y-axis gives the difference between the two methods whereas X-axis represents their mean.Area is measured in mm 2 .AD is measured in 10 −3 ×mmHg −1 .SD is the standard deviation.

Figure 4 .
Figure 4. Qualitative comparison of the proposed method with the state-of-the-art (SOTA) method24 in one time frame of the cardiac cycle for three representative cases.First column: MRI.Second column: Ground truth (semi-automated segmentation).Third column: Segmentation results of the SOTA method24 .Fourth column: Segmentation results of the proposed method.The yellow arrows indicate the errors in segmentation.The ascending aorta (AAo) is denoted by the red area and the descending aorta (DAo) is denoted by the green area.
provides a visual representation of the ConvLSTM block operation.The ConvLSTM block's Vol:.(1234567890)Scientific Reports | (2023) 13:21794 | https://doi.org/10.1038/s41598-023-48986-6www.nature.com/scientificreports/distinctiveness lies in its ability to apply convolutional operations within the gates, enabling the capture of spatial dependencies in the data.Then, the BConvLSTM output Y j at the time step j is calculated as where − → H and ← − H denote forward and backward hidden state tensors 31 , respectively, W − → H y and W

Figure 5 .
Figure 5.The proposed deep learning model, including two diagrams that illustrate how the temporal dimension is handled.

Table 2 .
Resource efficiency evaluation of the proposed method using the generated carbon emissions, the consumed energy and the equivalent distance a car could travel during the final training (250 epochs).Also shown are the training and average inference times.The experiments were conducted on a workstation with an NVIDIA RTX A6000 (48GB) GPU.Bold face indicates best performance.

Model Absolute error in area (mm 2 ) Absolute error in AD (10
standard development process necessitates multiple training runs due to the need for hyperparameter tuning and experimenting with various model architectures.To get a better handle on what the full DL model development pipeline might look like in terms of energy usage and carbon footprint, a recent study found that the process of building and testing a final paper-worthy DL model required training 4789 models over a 6-month period Vol:.(1234567890) Scientific Reports | (2023) 13:21794 | https://doi.org/10.1038/s41598-023-48986-6www.nature.com/scientificreports/